The first part of Exploratory Data Analysis will be to - explore! We’ll start with exploring each variable and then pair things down.
Now we can start pairing the dataset down to about 10-20 variables. I am most interested in the parameters that characterize the borrower and their loan.
After pairing the parameters to 14, I want to subset on those borrowers that have a loan.
Now that we have our first pass at the parameters of interest, I will explore them with univariate plots.
This looks like a long tailed data, so we will want to scale the x-axis. I also want to get the summary statistics on this parameter.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3500 6000 8472 11000 72500 91852
The mean loan amount is $8472, while the median amount is $6000. Let’s look at this parameter on a log scale.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 379 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing missing values (geom_bar).
The distribution looks more normal on a log scale. Nothing too outstanding from this plot other than the bins make it appear that there are more loans on the high end than the low end.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 13.40 18.40 19.28 25.00 49.75
## [1] 7.481843
Based on the websites information which shows borrower rates in the mid-single digits to mid double-digits, I am assuming that I need to scale the BorrowRate by 100. Therefore, the rate on the loans spans from 0.00 to 49.75 with a mean of 19.28. The standard deviation is 7.48.
The next two plots show the overall distribution of the return on investment that a lender can expect to make. In particular I zoomed in on the tail that fall under negative returns.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2288 rows containing non-finite values (stat_bin).
## Warning: Stacking not well defined when ymin != 0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 21943 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -18.270 7.408 9.170 9.607 11.660 28.370 29084
The range of returns spans from -18.27 to 28.37. While some of the negative returns are large, they are relatively few. The mean rate of return is 9.607% and the median is 9.17%, which is quite good in today’s low interest environment.
Let’s compare the number of loans that returned a loss to those that returned a gain.
## n
## 1 0.007173747
The ratio of negative returns to positive returns is 0.7% which seems quite low. Overall, it appears that the risk of lending on Prosper is pretty low.
Before converting the ProsperScore to an ordered factor, I plotted it and summarized it’s descriptive statistics.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2288 rows containing non-finite values (stat_bin).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 4.00 6.00 5.95 8.00 11.00 29084
The mean Prosper Score is 5.95 and the median is 6.00.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 699.0 704.6 739.0 899.0 591
The mean credit score is 704.6 and the median is 699.0. The distribution of lenders’ credit scores looks different compared to the distribution of the Prosper Score. The credit score has a handful of people at the low and high end of the range. Surely the two are related?!
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 6.00 9.00 9.26 12.00 54.00 7604
The average number of open credit lines is 9.26 while the median is 9.00. There are some people that have over 30 or 40 open credit lines, which seems like a lot.
## Warning: Removed 505 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3121 8549 17600 19520 1436000 7604
The mean revolving credit balance is $17,600 and the median is $8,549. Looks like there’s a few outliers that carry a revolving credit balance of over $250,000. I wonder if people with higher income generally have higher revolving balances.
## Warning: Removed 142 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 880 4100 11210 13180 646300 7544
## Warning: Removed 2220 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The mean of the debt-to-income ratio is 0.276 and the median is 0.220.
## 'data.frame': 22085 obs. of 16 variables:
## $ BorrowerRate : num 0.2085 0.2809 0.2225 0.0785 0.15 ...
## $ EstimatedReturn : num 0.0907 0.1248 0.0927 0.0555 NA ...
## $ ProsperRating..numeric. : int 3 2 3 7 NA 6 5 5 5 2 ...
## $ ProsperRating..Alpha. : Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: 3 2 3 7 NA 6 5 5 5 2 ...
## $ ProsperScore : num 4 4 8 10 NA 8 5 6 8 3 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 25 6 39 16 16 7 50 16 10 6 ...
## $ EmploymentStatusDuration: int 44 133 18 7 32 90 61 89 103 70 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 2 1 1 1 2 2 1 2 1 ...
## $ CreditScoreRangeUpper : int 699 659 699 779 679 699 739 739 679 619 ...
## $ OpenCreditLines : int 19 7 15 3 9 15 9 6 5 13 ...
## $ RevolvingCreditBalance : num 6193 14030 3140 41 4519 ...
## $ AvailableBankcardCredit : num 695 27 1987 3859 2031 ...
## $ DebtToIncomeRatio : num 0.26 0.15 0.41 0.05 0.27 0.35 0.16 0.25 0.03 0.28 ...
## $ IncomeRange : Ord.factor w/ 7 levels "NA"<"$0"<"$1-24,999"<..: 7 6 4 7 4 5 7 5 6 5 ...
## $ StatedMonthlyIncome : num 9583 8292 3075 13083 3167 ...
## $ ProsperPrincipalBorrowed: num 11000 10700 8000 4000 6000 27000 8000 15000 19000 6000 ...
I subset the original dataset to include only those individuals that have borrowed some amount through Prosper. I also downselected the number of variables of interest. In the end, there are 22,085 individuals with 14 variables. Most of these variables describe the characteristics of the borrower and so I want to see if I can use these to make an inference on one or more cohorts.
Other Observations: IncomeRange is an ordered factor variable: “NA”< “$0”< “$1-24,999”< “$25,000-49,999”< “$50,000-74,999”< “$75,000-99,999”< “$100,000+” The mean ProsperPrincipleBorrowed is: $8472 The mean ProsperScore is: 5.95 The mean BorrowerRate is: 19.28% The mean EstimatedReturn is: 9.607%
I am interested in the ProsperPrincipleBorrowed as well as the EstimatedReturn. I believe that other variables will be able to help predict one or both of these parameters.
I believe that EmploymentStatusDuration, IsBorrowerHomeowner, CreditScoreRangeUpper, OpenCreditLines, RevolvingCreditBalance, AvailableBankcardCredit, DebtToIncomeRatio, and StatedMonthlyIncome will help to predict either ProsperPrincipleBorrowed or EstimatedReturn.
Not yet.
The EstimatedReturn distribution has some outliers on both ends of the distribution and looks somewhat bimodal on a linear y axis. I used dplyr to select, subset, and order the variables in order to reduce the number of variables that I analyzed.
Now I am going to convert ProsperScore to an ordered factor because I believe (based on analysis later on) that this parameter will affect the EstimatedReturn.
Let’s crossplot all of the variables to get an idea of how they are related.
It looks like there’s a large and positive relationship between EstimatedReturn a BorrowerRate, which makes sense. Let’s come back to that in a little bit after we explore ProsperPrincipleBorrowed.
Facetwrapping Principle loan on borrower income…
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 379 rows containing non-finite values (stat_bin).
## Warning: Removed 21 rows containing missing values (geom_bar).
## pspr2$IncomeRange: NA
## NULL
## --------------------------------------------------------
## pspr2$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2512 5000 8374 10000 40000
## --------------------------------------------------------
## pspr2$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2000 4000 5147 6400 37000
## --------------------------------------------------------
## pspr2$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3000 5000 6449 8000 67000
## --------------------------------------------------------
## pspr2$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 3500 6000 8094 10400 65000
## --------------------------------------------------------
## pspr2$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 9791 13500 55900
## --------------------------------------------------------
## pspr2$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 5000 10000 12230 15500 72500
In general, the higher the income range the higher the median amount of principle borrowed - from $4k in the $1-24,999 income range all the way to $10k in the $100k+ income range. This is confirmed by the boxplot below.
Facetwrapping Principle loan on homeowner status.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 379 rows containing non-finite values (stat_bin).
## Warning: Removed 6 rows containing missing values (geom_bar).
## pspr2$IsBorrowerHomeowner: False
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3000 5000 7310 10000 67000
## --------------------------------------------------------
## pspr2$IsBorrowerHomeowner: True
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7000 9510 13000 72500
This boxplot shows that the median principle borrowed is higher for homeowners @ $7k vs non-homeowners @ $5k.
Borrower rates faceted on income…
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the boxplots, one can see that the borrower rate decreases, in general, with increasing income.
And then Home Owner.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the boxplot one can also see that borrower rates are lower for homeowners.
I want to look a two more variables, ProsperScore and ProsperRating..Alpha.
Clearly, the borrowed rate decreases with increasing prosper score.
Also, in general, the borrowed rate increases with ProsperRating.
Now going back to Estimated Return, let’s plot a histogram for lenders’ returns faceted by income range of borrower.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2288 rows containing non-finite values (stat_bin).
## Warning: Removed 2288 rows containing non-finite values (stat_boxplot).
There might be slight differences in the means of these groups but overall they look fairly similar.
Now let’s plot a scatterplot of returns vs borrower rate.
## Warning: Removed 104 rows containing missing values (geom_point).
There is definitely a positive correlation between these variables, but there’s some outliers that are worth investigating.
Let’s look at this more closely by including the full dataset.
## Warning: Removed 2288 rows containing missing values (geom_point).
It looks like there’s at least one more additional discrete variable that is stratifying the data. I will look into that during the multivariate analysis.
I first used ggplot to cross-plot all the variables against each other and then looked for any relationships that might indicate a relatively strong correlation. I noticed that EstimatedReturn and BorrowerRate had a strong correlation at 0.744, which makes sense.
Before jumping directly into understanding that relationship I wanted to make sure that I had a better understanding of the relationships between other variables. The following observations were made:
The higher the income range the higher the median amount of principle borrowed.
The median principle borrowed is higher for homeowners at $7k vs $5k for non-homeowners.
Median borrower interest rates decrease, in general, with increasing income. Homeowners also see lower rates.
Rates also drop with an increasing Prosper Score and, in general, decrease with an increase in the Prosper Rating.
Going back to EstimatedReturn and BorrowerRate, it was clear that I needed to dig into this relationship more closely, so I plotted a scatterplot of the two using a 1000-sampled dataset and noticed that while the majority of the points fell along a somewhat defined line there were some outliers. Once I plotted the full dataset it was clear that there was some stratification going on based upon some other variable. The next section will further delve into the variables that contribute to this stratification.
The relationship between income range and the median amount of principle borrowed was interesting. I don’t know whether it means that richer people are allowed to borrow more or if they typically engage in more capital intensive endeavors.
Looking at more relationships
## Warning: Removed 75 rows containing non-finite values (stat_density).
This is interesting as it shows that people who have a longer employment status typically make more money.
How does both EmploymentStatus and homeownership affect income?
## Warning: Removed 4 rows containing missing values (geom_point).
I’ve kept the axes essentially the same and it appears that people who own homes are more likely to make more money. However, this graph hides the discovery we made in the previous plot.
It appears that there is a positive correlation between the log(AvailableBankcardCredit) and OpenCreditLines.
This is kind of a strange one. A positive correlation between the log10 of Income and log10 of Revolving Credit. Here again, homeowners have higher income and more credit lines open.
Going back to EstimatedReturn, clearly there was a discrete stratification occurring based on some other parameter. This should serve as a clue as the parameter is likely an ordered factor variable. Now let’s see if we can identify the variable that is stratifying the EstimatedReturn. After trying a few different parameters it became evident that the ProsperScore and ProsperRating were influencing the stratefication.
Let’s see if we can build a linear model of EstimatedReturn based on this information.
##
## Calls:
## m1: lm(formula = I(EstimatedReturn * 100) ~ BorrowerRate, data = subset(pspr2,
## !is.na(EstimatedReturn)))
## m2: lm(formula = I(EstimatedReturn * 100) ~ BorrowerRate + ProsperScore,
## data = subset(pspr2, !is.na(EstimatedReturn)))
## m3: lm(formula = I(EstimatedReturn * 100) ~ BorrowerRate + ProsperScore +
## ProsperRating..Alpha., data = subset(pspr2, !is.na(EstimatedReturn)))
##
## ==============================================================
## m1 m2 m3
## --------------------------------------------------------------
## (Intercept) 3.366*** 2.574*** -8.346***
## (0.044) (0.077) (0.103)
## BorrowerRate 33.238*** 36.288*** 88.818***
## (0.212) (0.344) (0.497)
## ProsperScore: .L 2.358*** 1.456***
## (0.085) (0.057)
## ProsperScore: .Q -1.711*** -0.772***
## (0.063) (0.045)
## ProsperScore: .C 0.221*** 0.344***
## (0.061) (0.041)
## ProsperScore: ^4 -0.354*** -0.100**
## (0.057) (0.037)
## ProsperScore: ^5 0.509*** 0.201***
## (0.054) (0.035)
## ProsperScore: ^6 -0.125* -0.386***
## (0.054) (0.035)
## ProsperScore: ^7 0.146** 0.166***
## (0.053) (0.035)
## ProsperScore: ^8 -0.188*** -0.051
## (0.051) (0.033)
## ProsperRating..Alpha.: .L 13.849***
## (0.130)
## ProsperRating..Alpha.: .Q -3.483***
## (0.071)
## ProsperRating..Alpha.: .C 0.869***
## (0.056)
## ProsperRating..Alpha.: ^4 -0.511***
## (0.041)
## ProsperRating..Alpha.: ^5 -0.072*
## (0.032)
## ProsperRating..Alpha.: ^6 -0.011
## (0.028)
## --------------------------------------------------------------
## R-squared 0.6 0.5 0.8
## adj. R-squared 0.6 0.5 0.8
## sigma 2.3 2.4 1.5
## F 24546.2 1968.4 4408.4
## p 0.0 0.0 0.0
## Log-likelihood -44784.3 -39398.6 -31887.3
## Deviance 106912.1 96449.9 40459.9
## AIC 89574.6 78819.3 63808.6
## BIC 89598.3 78904.6 63940.5
## N 19797 17293 17293
## ==============================================================
The variables in this model account for ~80% of the variance in the estimated return.
Both the Prosper Score and the Prosper Rating were found to help explain the investor’s Estimated Return. The strongest clue in helping uncover the right variables was that the stratification was discrete and so implied an ordered factor variable(s). Once I plotted the Estimated Return while faceting on ProsperRating..Alpha. and coloring based on ProsperScore it was obvious that those two variables along with BorrowerRate helped explain the vast majority of EstimatedReturn’s values.
It should be noted that, while not obvious from the last plots, there does seem to be some variance unaccounted for that also appears to be caused by discrete stratification. After attempting to add the remaining ordered factor variables into my model there were no improvements in the R-squared value and so the final variable may be one that was deprecated in the initial round of selection.
While these features didn’t impact expected return, the density plot of Income Range against employment duration shows that people who have a longer employment status typically make more money.
Yes. The model does a fairly good job of predicting someone’s return when they lend money on Prosper given the R-squared value of 0.8. However, there is a small amount of variance unaccounted for that appears to be due to a factored variable. I was not able to identify this variable and believe that it may have been deprecated when initially pairing down the original number of features.
The mean rate of return is 9.607% and the median is 9.17%, which is quite good in today’s low interest environment. The distribution appears to be negatively skewed.
## Warning: Removed 2288 rows containing missing values (geom_point).
## Warning: Removed 15 rows containing non-finite values (stat_boxplot).
The Estimated Return is positively correlated with the Borrower Rate but there is variance that looks to be due to stratification from the Prosper Score. The Prosper Score is one factor that determines the borrower’s rate and, therefore, the lender’s return on investment. The lower the Prosper Score, the higher the borrower’s rate.
The Estimated Return on Investment is dependent on the Borrower Rate, the Prosper Score, and the Prosper Rating. All three of these factors explain about 80% of the variance in the data. Returns top out between ratings C, D, and E with a medium to high Prosper Score.
The Prosper dataset contains 113,937 observations and 87 variables. I started by reducing the number of variables down to 16 of the most interesting which pertained to the borrower and lender characteristics. I then explored some of the individual parameters which led to me reducing the sample size down to only those which had borrowed money on the platform. I started to focus on the Estimated Return for the lender because that is one of the most pertinent for those coming to the platform. As I started to plot the Estimated Return against several variables I started to notice that there were several variables that naturally popped out and explained the variance in the data. These variables included the Borrower Rate, Prosper Score, and Prosper Alpha Rating. While I didn’t know how the Prosper Score and Alpha Rating were derived for each borrower, it makes sense that these scores would affect both the lending rate and the return. I then made a linear model which incorporated these three variables and demonstrated that they could explain 80% of the variance in the data. I tried to search for other variables that would explain the last 20% but couldn’t find any out of the 13 left. I believe that I may have eliminated it when depricating the number of variables at the beginning of the analysis. I would be interested in both looking for the variable to explain the last 20% as well as further exploring how the Prosper Score and Alpha Rating were derived.
References http://stackoverflow.com/questions/4605206/drop-data-frame-columns-by-name https://s3.amazonaws.com/udacity-hosted-downloads/ud651/diamondsExample.html